Workflow: Basics and Data Transformation

Module 02

Ray J. Hoobler

Workflow Basics (Chapter 2, R for Data Science)

Coding basics: Review R for Data Science, Chapter 2 (1/4)

You can use R to do basic math calculations:

Code
1 / 200 * 30
[1] 0.15


Code
(59 + 73 + 2) / 3
[1] 44.66667


Code
sin(pi / 2)
[1] 1

Coding basics: Review R for Data Science, Chapter 2 (2/4)

Create new objects with <-

Code
x <- 3 * 4
x
[1] 12


Combine multiple elements into a vector with c():

Code
primes <- c(2, 3, 5, 7, 11, 13)
primes
[1]  2  3  5  7 11 13

Coding basics: Review R for Data Science, Chapter 2 (3/4)

Basic arithmetic on vectors is applied to every element of of the vector:

Code
primes * 2
[1]  4  6 10 14 22 26
Code
primes - 1
[1]  1  2  4  6 10 12

Coding basics: Review R for Data Science, Chapter 2 (4/4)

Comments

R will ignore any text after # for that line.

Code
# create vector of primes
primes <- c(2, 3, 5, 7, 11, 13)

# multiply primes by 2
primes * 2
[1]  4  6 10 14 22 26

What’s in a name?

```{r}
i_use_snake_case
otherPeopleUseCamelCase
some.people.use.periods
And_aFew.People_RENOUNCEconvention
```


  • snake case has few drawbacks
  • camel case may be more easy to type; note the first word is typically lowercase
  • periods may cause issues with some functions (not considered a best practice in Python)

Calling functions

```{r}
function_name(argument1 = value1, argument2 = value2, ...)
```


Code
seq(from = 1, to = 10, by = 2)
[1] 1 3 5 7 9


Code
seq(1, 10, 2)
[1] 1 3 5 7 9

Module 2 Assignment 1

Complete the exercises in section 2.5 of R for Data Science.

Data Transformation (Chapter3, R for Data Science)

Libraries Needed for the Following Slides

Code
library(tidyverse)
library(palmerpenguins) 

dplyr basics

dplyr functions have the following characteristics:

  1. The first argument is a data frame.
  2. The subsequent arguments describe what to do with the data frame.
  3. The output is a new data frame.

Because the input and output are both data frames, you can string together multiple dplyr functions with the pipe operator |>.

Note

Previous versions of dplyr used the %>% operator which was imported from the magrittr package. The |> operator is now the preferred operator for piping in dplyr and is native to R.

dplyr Functions for Transforming Rows (1/2)

  • filter()
  • arrange()
  • distinct()

dplyr Functions for Transforming Rows (2/2)

  • filter() : Extracts rows that meet a logical condition.
  • arrange() : Reorders rows.
  • distinct() : Finds all unique rows, usually based on a subset of columns.

filter()

Filter the penguins data frame to include only observations where the species is “Adelie”.

Code
penguins |> filter(species == "Adelie") |> tail()

We place the species name in quotes because it is a character string.

Note

Factors are used for categorical variables, variables that have a fixed and known set of possible values. They are also useful when you want to display character vectors in a non-alphabetical order.

Use filter() to find outliers

Use body_mass_g variable of penguins data

EDA Boxplot of penguins data

Code
penguins |> ggplot(aes(x = species, y = body_mass_g)) + geom_boxplot()

Based on the boxplot, only the Chinstrap species has outliers.

Outliers are observations have a value of 1.5 times the IQR. (\(IQR = Q_3 - Q_1\))

\[ \text{x} \begin{cases} \ge Q_3 + 1.5 \times IQR \\ \le Q_1 - 1.5 \times IQR \end{cases} \]

Find the Outliers and Apply the Filter (1/2)

Code
# Create a new data frame of the Chinstrap species
chinstrap <- penguins |> filter(species == "Chinstrap")

# Find the IQR, Q1 and Q3 values using stats functions
iqr_chinstrap <- IQR(chinstrap$body_mass_g, na.rm = TRUE)
q1_chinstrap <- quantile(chinstrap$body_mass_g, 0.25, na.rm = TRUE)[[1]]
q3_chinstrap <- quantile(chinstrap$body_mass_g, 0.75, na.rm = TRUE)[[1]]
Code
sprintf("IQR: %.2f", iqr_chinstrap)
[1] "IQR: 462.50"
Code
sprintf("Q1: %.2f", q1_chinstrap)
[1] "Q1: 3487.50"
Code
sprintf("Q3: %.2f", q3_chinstrap)
[1] "Q3: 3950.00"

Notice the chinstrap$body_mass_g variable is used in the IQR() and quantile() functions. This is a common way to access variables in a data frame. Notice the na.rm = TRUE argument in the functions. This argument removes missing values from the calculations.

These functions are not part of the dplyr package, but are part of the stats package which is loaded when R starts

I’m also using the sprintf() function to format the output. The %.2f argument tells R to format the output as a floating point number with two decimal places. Check out the ?sprintf help file for more information on formatting output. We don’t often use this function with Markdown, but these types of functions are common in other languages like C and Python.

Find the Outliers and Apply the Filter (2/2)

Outliers in the penguins data frame are observations where the species is “Chinstrap” and the body_mass_g is greater than or equal to 1.5 times the IQR or less than or equal to 1.5 times the IQR.

This is \(\ge\) 4.64kg and \(\le\) 2.79kg, respectively.

Code
# Use IQR in filter function
penguins_outliers <- chinstrap |>
  filter(body_mass_g >= q3_chinstrap + 1.5 * iqr_chinstrap | body_mass_g <= q1_chinstrap - 1.5 * iqr_chinstrap
         )

penguins_outliers

Using the arrange() function to reorder rows

Code
# Reorder the penguins_outliers data frame by body_mass_g
penguins_outliers |> arrange(body_mass_g)
Code
# Reorder the penguins_outliers data frame by body_mass_g in descending order
penguins_outliers |> arrange(desc(body_mass_g))

(Note: the original order was simply by order of appearance in the penguins data frame.)

Using the distinct() function (1/2)

To find unique rows in a data frame

Code
# Find unique rows in the penguins data frame
penguins |> distinct() 
Code
nrow(penguins)
[1] 344

Using the distinct() function (2/2)

Unique pairs of species and island

Code
penguins |> distinct(species, island)

Module 2 Assignment 2

Answer question 4 in section 3.2.5 Exercises from R for Data Science (2e)

Code
library(nycflights13)

# Find the number of unique flights in the nycflights13 data frame
flights |> distinct(month, day) |> nrow()
[1] 365

dplyr Functions for Transforming Columns (1/4)

mutate()

Mutate may be the most powerful tool in the tidyverse.

We can use mutate() to create new columns or modify existing columns.

Using mutate() to create a new column (1/3)

Simple mathematical transformation of an existing column: convert body_mass_g to kilograms.

Code
penguins_kg <- penguins |> 
  mutate(body_mass_kg = body_mass_g / 1000, .before = body_mass_g)

penguins_kg

Using mutate() to create a new column (2/3)

We can use mutate() and use existing columns to create new columns.

Code
penguins_ratio <- penguins |> 
  mutate(bill_ratio = bill_length_mm / bill_depth_mm, .before = bill_length_mm)

penguins_ratio

Using mutate() to create a new column (3/3)

We can apply logical conditions to create new columns.

Code
penguins_binary <- penguins |> 
  mutate(island_binary = if_else(island == "Dream", 1, 0), .before = island)

penguins_binary

Keep mutate() in mind when you ask the question. . .

Do I need to change something in my data frame?

Using select() to keep or drop columns (1/4)

Probably the most straightforward function in dplyr.

Simply list the columns you want to keep, or…

Code
penguins_select <- penguins |> 
  select(species, island, body_mass_g, sex, year)

penguins_select

Using select() to keep or drop columns (2/4)

…list the columns you want to drop.

Code
penguins_drop <- penguins |> 
  select(-bill_length_mm, -bill_depth_mm, -flipper_length_mm)

penguins_drop

Using select() to keep or drop columns (3/4)

You can also use the index of the column and use the : operator to select a range of columns.

Code
penguins_select_index <- penguins |> 
  select(1:2, 6:8)

# or use the ! (not) operator to drop columns

penguins_select_index2 <- penguins |> 
  select(!3:5)

penguins_select_index
Code
penguins_select_index2

Using select() to keep or drop columns (4/4)

See ?select for more details. Once you know regular expressions (the topic of Chapter 15) you’ll also be able to use matches() to select variables that match a pattern.

You can rename variables as you select() them by using =. The new name appears on the left hand side of the =, and the old variable appears on the right hand side:

Code
penguins_rename <- penguins |> 
  select(Species = species, Island = island, "Body Mass (g)" = body_mass_g)

penguins_rename2 <- penguins |> 
  rename(Species = species, Island = island, "Body Mass (g)" = body_mass_g)

penguins_rename
Code
penguins_rename2

Using relocate() to move columns

relocate() is used to move columns to a new position in the data frame.

By default, variables are moved to the first position; however, the .after or .before arguments can be used to specify a new position.

Code
penguins_relocate <- penguins |>
  relocate(body_mass_g, .after = island)

penguins_relocate

The pipe operator |>

We can string together multiple dplyr functions with the pipe operator |>.

Let’s find the five largest penguins by body mass, convert the mass to kg, and create a table showing the species, island, and body mass. Finally, we can rename the variables to make the table more readable.

Code
penguins |>
  mutate(body_mass_kg = body_mass_g / 1000) |>
  arrange(desc(body_mass_kg)) |>
  select(species, island, body_mass_kg) |>
  rename(Species = species, Island = island, "Body Mass (kg)" = body_mass_kg) |>
  head(5)

For exploratory data analysis, we may not save the table as a new object.

group_by() and summarize()

  • group_by() : Group data by one or more variables.
  • summarize() : Summarize data by collapsing each group into a single row.

Using group_by()

Group the penguins data frame by species and island.

Look carefully at the new data frame. What’s changed?

Code
penguins_grouped <- penguins |> 
  group_by(species, island)

penguins_grouped

The data frame has a new class attribute: “Groups”

Using summarize() on grouped data

Summarize the grouped data frame by calculating the mean body_mass_g for each

Code
#| code-fold: show

penguins_grouped |> 
  summarize(mean_body_mass_g = mean(body_mass_g, na.rm = TRUE))

Using group_by() and summarize() together

Group the penguins data frame by species and island and summarize the data by calculating the mean body_mass_g for each.

Code
penguins_grouped_mean <- penguins |> 
  group_by(species) |> 
  summarize(
    count = n(), 
    mean_body_mass_g = mean(body_mass_g, na.rm = TRUE)
    ) |> 
  ungroup()

penguins_grouped_mean

Module 2 Assignment 3

Ask a question about the penguins data frame and use dplyr functions to answer it.

Use the final output of your analysis to create a table and ggplot2 visualization. For your visualization, include a title, subtitle, axis labels, and a caption.

Code
# Question: What is the average body mass of penguins by Species and Island?
penguins_analysis <- penguins |> 
  group_by(species, island) |> 
  summarize(
    count = n(), 
    mean_body_mass_g = mean(body_mass_g, na.rm = TRUE)
    ) |> 
  ungroup() |> 
  rename(Species = species, Island = island, Count = count, "Mean Body Mass (g)" = mean_body_mass_g)

penguins_analysis
Code
penguins_analysis |> 
  ggplot(aes(x = Island, y = `Mean Body Mass (g)`/1000, fill = Species)) + 
  geom_col(position = position_dodge(preserve = "single")) + 
  labs(title = "Average Body Mass of Penguins by Island",
       subtitle = "Data from the palmerpenguins package",
       x = "Island",
       y = "Mean Body Mass (kg)",
       caption = "Data source: palmerpenguins package") + 
  theme_classic()

End of Module 2

References